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Abstract 

Bayes statistics and statistical physics have the common mathematical structure, where 
the log likelihood function corresponds to the random Hamiltonian. Recently, it was dis- 
, covered that the asymptotic learning curves in Bayes estimation are subject to a universal 

law, even if the log likelihood function can not be approximated by any quadratic form. 
However, it is left unknown what mathematical property ensures such a universal law. In 
this paper, we define a renormalizable condition of the statistical estimation problem, and 
show that, under such a condition, the asymptotic learning curves are ensured to be subject 
to the universal law, even if the true distribution is unrealizable and singular for a statisti- 
cal model. Also we study a nonrenormalizable case, in which the learning curves have the 
[~~«. ' different asymptotic behaviors from the universal law. 

m ' 

On 

(N ; 1 Introduction 

t— i ■ 

In recent studies, it was pointed out that Bayes statistics and statistical physics have the com- 
mon mathematical structure, where the log likelihood function plays the same role as the random 
Hamiltonian, and the Bayes posterior distribution can be understood as the Boltzmann distri- 
bution. However, there are some differences between them. In statistical learning theory, the 
random Hamiltonian can not be necessarily approximated by any quadratic form because the 
Hessian matrix of the log likelihood function can be singular [lj. For example, artificial neu- 
ral networks [2J, normal mixtures [3], reduced rank regressions [I], Bayes networks [5], binomial 
mixtures, Boltzmann machines, and hidden Markov models are singular models. 

The statistical properties of such models have been left unknown in statistics and information 
science, because it was difficult to analyze a singular likelihood function [HE]. Recently, new 
statistical learning theory has been established based on algebraic geometry, by which it was 
proved that the generalization and training errors are subject to a universal law, even if the 
statistical model does not satisfy the regularity condition EJ EJ [TOj, [Til 02] ■ However, it is not 
yet clarified what mathematical properties ensure that such a universal law holds, therefore it 
is unknown the range of statistical problems which are subject to the universal law. 

In this paper, we define a renormalizable condition of a statistical problem. The renormal- 
izable condition requires that the variance function of the random Hamiltonian is bounded by 
the average one. We show that, if a statistical problem is renormalizable, then the algebraic 
geometrical method can be successfully applied, resulting that the learning curves are subject 
to the universal law. Also we show that, if it is not renormalizable, then the large fluctuation 
of the random Hamiltonian prevents the system from obeying to the universal law in general. 
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2 Bayes Learning Theory 



Let q(x) be a probability density function on N dimensional real Euclidean space R . The train- 
ing samples and the testing sample are respectively defined by random variables X 1; X 2 , X n 
and X, which are independently subject to the same probability distribution q{x)dx. 

A statistical model is defined as a probability density function p(x\w) of x £ M. N for a given 
parameter w G W C M. d , where TV is a set of all parameters. In Bayes estimation, we prepare 
a probability density function (p(w) on W. Although (p(w) is called a prior distribution, it does 
not necessary represent an a priori knowledge of the parameter, in general. 

For a given function F(w) on W, its expectation value (F(w)) with respect to the posterior 
distribution is defined by 



(F(w)) 



F(w) YlpiXilwf (p(w)dw 



Y^piXilw) 13 ip(w)dw 

i=l 



where < ft < oo is the inverse temperature. The case ft = 1 is most important because it 
corresponds to the strict Bayes estimation. The Bayes predictive distribution is defined by 

p*(x) = {p(x\w)). 

In Bayes learning theory, the following random variables play an important role. The Bayes 
generalization loss B g , the Bayes training loss Bt, the Gibbs generalization loss G g , and the 
Gibbs training loss Gt are respectively defined by 

B g = -E x [log(p(X\w)}], (1) 
Bt = -if>g(p(X»), (2) 

i=l 



G 9 = -(E x [logp(X\w)]), (3) 
i n 

G * = -(-E lo s^ x »)' ( 4 ) 

i=l 

where Ex[ ] shows the expectation value over X. Let us introduce two random variables by 

Y g = E x [((logp(X\w)) 2 ) - (logp(X\w)) 2 ], (5) 

1 n 2 

Yt = -E{(( lo gK^N) 2 )-(logp(^k)) }, (6) 



n 

i=i 



where V g = nY g and Vt = nY t are referred to as the functional variances [9l[T0]. In this paper, we 
study the expectation values of these six random variables, which are called Bayes observables. 
The log loss function L(w) and the entropy S are respectively defined by 

L(w) = -E x [\ogp{X\w)l 
S = -E x [logq(X)}. 

Note that L(w) = S + D(q\\p w ), where D(q\\p w ) is the relative entropy or Kullback-Leibler 
distance defined by 

/q(x) 
q(x) log dx. 
p(x\w) 
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Therefore, always L(w) > S. Moreover, L(w) = S if and only if p{x\w) = q(x). In this paper, 
we assume that there exists a parameter wq G W which minimizes L(w), 

L{wq) = min L(w). 

Note that such wq is not unique in general, because the map w h-> p(x\w) is not one-to-one 
in general. We assume that, for an arbitrary w that satisfies L{w) = L(wq), p(x\w) is the 
same probability density function. Let po(x) be such a unique probability density function. For 
simplicity, we use notation Lq = —Ex[\ogpo(X)]. 

Definition. If q(x) = po(x), then q(x) is said to be realizable by p(x\w), if otherwise it is said 
to be unrealizable. 

Definition. If the set Wq = {w £ W;po(x) = p(x\w)} consists of a single point wq and if the 
Hessian matrix J = WL(wq) is strictly positive definite, then q(x) is said to be regular for 
p(x\w). If otherwise, then q(x) is said to be singular for p(x\w). 

Bayes learning theory was studied in realizable and regular cases |13|. [14} I15| . realizable and 
singular cases [7J [91 [10], and unrealizable and regular cases [11]. In such cases, it was proved 
that there exists a universal relation between the generalization and training errors. In this 
paper, we mainly study unrealizable and singular cases. 



3 Generating Function of Statistical Learning 



The log density ratio function /(x, w) and the log likelihood ratio function H n (w) are respectively 
defined by 



f(x,w) 
H n {w) 



log 



Po(x) 



p(x\w) ' 

1 n 



w 



where nH n (w) is referred to as the random Hamiltonian. In this paper, we introduce the 
generating function of Bayes learning theory by 



Fn{oe) = E — log / ex.p(—af(X,w) — (3nH n (w))(p(w)dw 



where E[ ] shows the expectation value over X\, X2, ■■, X n and X. Then, by the definitions 
eq.(H])-eq.([S]) and by using the fact that logpo^) is a constant function of w, it immediately 
follows that 



E[B g ] = L + F n (l)-F n (0), (7) 

E[B t ] = Lo + F n ^{l + P)-F n ^\ (8) 

E[G g \ = Lo + i^(0), (9) 

E[G t ] = Lo + F^P), (10) 

E[Y g ] = -i^'(O), (11) 

E[Y t ] = -FZ_M. (12) 



These equations show that F n (a) determines the behaviors of average Bayes observables [7[ 1141 
115] , In order to analyze these values, we need assumptions. 
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Definition. If there exists a constant 7 > such that 



lim sup \FP(a)\n^ = 0, (13) 

n ^°° 0<q<1+/3 

lim |^(0) - i^_ 1 (0)K = 0, (14) 
lim |J^'(0)-iCi(0)K = 0, (15) 

n— >oo 

then the generating function is said to satisfy the conditions of learnability with index 7. 

Let us assume that the conditions of learnability are satisfied. Then, by using Taylor expansions 
of F n (a), F^(a), and F"(a), it follows that 

E[B g ] = L + ^(0) + ^(0) + o(l), (16) 

E[B t ] = £0 + ^(0) + 2^+1^(0) + 0(1), (17) 

E[G g ] = Lq + F^O), (18) 

E[G t ] = Lo + ^(0) + /3K'(0) + o(l), (19) 

E[Y g ] = -i^'(O), (20) 

EPS] = -F^(0)+o(l). (21) 

Therefore, we obtain the equations of states in statistical learning, 

E[B g ] = E[B t ]+PE[Y t ]+o(^-), (22) 

E[G g ] = E[G t ]+pE[Yt] + o{^). (23) 

That is to say, if the conditions of learnability are satisfied, then the equations of states hold. 
Minimization of both E[B g ] and E[G g ] is one of the main purposes of statistical estimation, 
however, they need the expectation value over the testing sample Ex[ ], hence they cannot be 
calculated directly from training samples. On the other hand, B t , Gt, and Y t can be calculated 
from only training samples without any direct information about q(x). In other words, the 
equations of states show that E[B g ] and E[G g ] can be estimated from training samples, therefore 
Bt+f3Yt and Gt+f3Yt are information criteria which show how appropriate the set (p(x\w), <p(w)) 
is. In fact, they are equal to AIC [16] if q[x) is realizable by and regular for p(x\w). If q(x) is 
unrealizable by or singular for p(x\w), then AIC is not equal to the asymptotic generalization 
error, whereas Bt + /3Yt and Gt + /3Yt are. Hence they are called widely applicable information 
criteria (WAIC) PCEICE]. 



4 Renormalizable Case 

Let us define the renormalizability. 

Definition. Let W e = {w S W; D(po\\p w ) < e}. If there exist A > and e > such that 

w E W e =^ L{w) -L >A D{p \\ Pw ), 

then the pair (q(x),p(x\w)) is said to be renormalizable. If otherwise, nonrenormalizable. 

It is easy to show that, if q(x) is regular for p(x\w), then (q(x) , p(x\w)) is renormalizable. In 
fact, D(po\\p w ) is smaller than some quadratic form of w — wq in the neighborhood of unique 
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wq and L(w) — Lq has a positive definite Hessian matrix. Also, it is trivial to show that, if 
q(x) is realizable by p(x\w), then (q(x) , p(x\w)) is renormalizable. In fact, since q(x) = po(x), 
L(w) — L = D(p \\p w ). However, if q(x) is unrealizable by and singular for p(x\w), then 
(q(x) , p(x\w)) may be renormalizable or nonrenormalizable. 

In this section, we study the renomalizable case, and show that the conditions of learnability 
hold with index 7 = 1 and that the Bayes observables are subject to the universal law. 

We assume that L(w) is an analytic function of w £ W and that w t-} f(x, w) is a function- 
valued analytic function. Since J po(x)dx = j p w (x)dx = 1, 

D( Po \\ Pw ) = J Po (x)(f(x,w)+e-^ - l)dx. 

There exists a constant B > such that 

t + e~* -1 „ „ . 
T2 >B (\t\<e). 

By combining this inequality with the renormalizability, it follows that 

L(w) — L > AB j p (x)f(x,w) 2 dx. (24) 

Since L(w) — Lq is an analytic function, we can apply resolution of singularities |17} [19] to 
L(w) — Lq, and obtain the following result. There exist both a real ci-dimensional analytic 
manifold Ai and a real analytic map g : Ai — > W such that, in each local coordinate of Ai, 

d 

,2k — TT „. 2k 3 



L(g(u))-L = u 2k ^\{uf> 



\g'(u)\y{g{u)) = b(u)u h = b(u) f[ 



d 

hi 



where k = (k%, hz, —, ka) and h = (h\, hd) are multiple indeces made of nonnegative 

integers, is the Jacobian determinant of the map w = g(u), and b(u) > 0. Then, by 

using eq. (|24~l) . f(x,g(u)) 2 can be divided by u 2k , in other words, f(x, g(u))/u k is a well-defined 
analytic function. In fact, if f(x,g(u)) can not be divided by u 2k , then eq. pi]) does not hold. 
Hence, there exists a function-valued analytic function a(x, u) such that 

f{x,g(u)) = a(x,u)u k . 

Moreover, from L(w) — Lq = Ex[f(X,w)], we have Ex[a(X,u)] = u k . Remark that both 
renormalizability and resolution theorem are necessary to prove the existence of a(x,u). Let us 
define an empirical process on Ai, 

1 n 

U(u) = -= V{a(Xi,n) -u k }. 

1=1 

Then the probability distribution of £ n (u) converges to that of the gaussian process £(u), which 
is uniquely determined by its average and covariance [9j [TH] , 

E&(u)] = 0, 

E^(u)C(u')] = Ex[a(X,u)a(X,u')}-E x [a(X,u)]E x [a(X,u% 

where E^[ ] shows the expectation value over the gaussian process £(u). Moreover, the gaussian 
process £(u) can be represented by 

00 

CN = ^2c j (u)g j 
3=1 
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where {gj} are independent random variables and each gj is subject to the standard normal 
distribution. Then 

oo 

E^{u)i{u')] = Y J C ] (u)c j {u'). 

The random Hamiltonian is rewritten as 

nH n (g(u)) = nu 2k - \/nu k £ n (u). 
To study the generatining function F n (a), we need the asymptotic behavior of 

z n(s) = / f(x,w) s ex.p(-/3nH n (w))(p(w)dw, 



where s > is a real value. For example, 



i£(0) = E 
<(0) = —E 
Then by using the function w = g(u), 



Z n (l) 



Z n (0) 
Z n (2) 



Z n (0) 



+ E 



Z n (l) 



Z n (0) 



(25) 
(26) 



Z n (s) = ^2 du a(x,u) s u sk+h exp(-(3nu 2k + f3y/nu k £ n (u))b a (u) 

a 

= r dt [ du - $(--u 2k ) a(x,u) s u sk+h exp(-(3t + pVt£ n (u))b a (u), 



where ^ Q shows the sum over all local coordinates and b a (u) > satisfies Y^ a ba(u) = b(u). By 
using the asymptotic expansion of the Shcwartz distribution 5(t/n — u 2k ) for n — > oo [3 EJ [101 
I20| E2J [25l , there exists a Schwartz distribution D a (u) such that 



\m— 1 



Y.\ 5 {{- « 2 ') w«) s 1!; ^r : * A - I+>/2 (£ <«>) • 



where A > is the Zog canonical threshold defined by 



A = minmin 



i rhj + i 



a j=l \ 2kj 



and m is the maximum number of j which attains the above minimum. Also ^2 a * shows the 
sum over all local coordinates that attain the above minimum and the support of D a *{u) is 
contained in the set {u S Ai; L(g(u)) — L = 0}. Hence 



Z n {s) 



(log n 



.m— 1 



n 



X+s/2 



V 



(u,t)t s / 2 exp(Me(«))). 



where f D(u,t) is defined by the integration over the manifold, 

J V(u,t) = J dt J duD a *{u) t x ~ l exp(-pt). 

Let us define 

Z(q,r,s) = f V{u,t) t{u) q fl 2 a(x,u) s exp(pVt£(u)). 
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Then 



Z n (s) 



(log n 



,m— 1 



n 



A+s/2 



-Z(0,s,s). 



(27) 



u k , 



Firstly, since Ex[a(X,u 

E X [Z(Q,1,1)} = £(0,2,0) 
Secondly, by using the partial integration of t 

/"CO \ pOC -i /"GO 

/ eft fe-fto+Wm = ± dt iA-i e -^Vte(«) + i / d f ^-V2^( 
•/ o /3 Jo 2 •/ o 



u e 



-0t+/3Vi£(u) 



it follows that 



Z(0,2,0) = ^Z(0,0 ) 0) + ^(1,1,0). 



And lastly, by using the partial integration over the gaussian process £(u), 



Z(1,1,0) 



Z(0,0,0) 



E f 



3=1 

oo 



(3ExE^ 



i=i 
Z(0,2,2) 
Z(0,0,0) 



8 \ t 1 / 2 exp((3Vt£(u)) 
dgj) fV(u',t')exp(pVFt(u')) 



(3E x E e 



Z(0,1,1) 



Z(0,0,0) 



(28) 



where we used ££[£(it)£(ii')] = £gf[a(X, u)a(X, u')] on the set {u; L(g(u)) — Lq = 0}. Let us 
define the constant 2u by the right hand side of eq. (!28p . where v is referred to as the singular 
fluctuation. Then by using eqs. (p5]) . (p6]) . (p71) . 



.A . 1 

2v 1 
~~fl ' n' 



Therefore, we obtained the universal law of Bayes observables 

E[B g ] 
E[B t ] 
E[G g ] 
E[G t ] 
E[Y g ] 



.A — v .1 
Lq + {——+u -, 

p n 

L + (--u )-, 

p n 



(29) 
(30) 
(31) 
(32) 
(33) 



In this case, we can prove that the conditions of learnability with index 7 = 1 are satisfied by 
the same way as [10] ■ Hence, equations of states hold with 7 = 1. 
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5 Nonrenormalizable Case 



In this section, we study a nonrenormalizable case. It is still difficult to clarify the general 
nonrenormalizable case. Hence, in this section, we show that there exists a simple example in 
which the Bayes observables do not satisfy the universal law. 

q(x,y) = ^exp(-i(x 2 + y 2 )), (34) 

p(x,y\a) = -L exp(-i{(x - a) 2 + (y - V« 4 - a 2 + l) 2 }), (35) 

where a G I 1 is a parameter. Then the relative entropy is 

q(x, y) log — : — '——dxdy = -(a 4 + 1). 
p(x,y\a) 2 

Hence D(q\\p a ) is minimized at a = 0, and Lq = log(2-7r) + 3/2. The Hessian is given by 
dlD(q\\p a )\ a= Q = 0. Therefore q(x) is unrealizable by and singular for p(x\a). The log density 
ratio function is 

a 4 

f(x, a) = -ax - h(a)y + 
where h(a) = Va 4 — a 2 + 1 — 1 is a real analytic function, and 

D(po\\ Pa ) = ^-h(a). 

Note that D(p \\p a ) = a 2 /2 in the neighborhood of a = 0. On the other hand, L{a) — Lq = a /2, 
resulting that (q(x),p(x\a)) is not renormalizable. The random Hamiltonian is 

4 

72 CL 

nH n {a) = — y/na^i- y/n h{a) £ 2 , 



where 



are independently subject to the standard normal distribution. 

nH n (a)' = 2 n a 3 — \/n £i — y^- £2, 
nH n {a)" = 6na 2 -\^/i(o)"e 2 . 

The parameter a that minimizes nH n (a)' is denoted by a*. Since 

the main order term of nH n (a) is given by 

nH n (a) = ^nH n (a*)(a - a*) 2 + nH n (a*) 
= 7rCn(a — -D n ) 2 — -CnD 2 ,, 



where 



C n = 6n(6/2^) 2/3 , 
D n = (6/2v^) 1/3 . 
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Therefore, by using = 2^/ 2 r( / u/2)/ v / 27r, 

Q 1 



<(0) 



2 n 2 /3 : 



0» = -« 1 



where 

2 7/6 7 

q= r i . 

V27r 6 



The asymptotic behaviors of Bayes observables are different from the universal law, 

m] * L o-(^ + i)-Jj, (37) 

E[G g ] - Lo + | ~ 3 , (38) 

E[G t ) - (39) 

* ^M^y'J^ (40) 

Also in this case, the conditions of learnability are satisfied with index 2/3, hence the equations 
of states hold with 7 = 2/3, however, 

E[V t ] = nE[W t ] = n 1/3 

does not converge to the constant. It seems that both renormalizable and nonrenormalizable 
statistical problems satisfy the more general universal law. 



6 Discussion 



In this section, let us discuss three points, birational invariants, renormalizability, and Bayes 
observables as random variables. 



6.1 Birational Invariants 

In section 01 we proved that, in the renormalizable case, the asymptotic learning curves are 
determined by A and v, which are defined by using resolution of singularities. Let us study the 
mathematical properties of them. For a given analytic function, L(w) — Lq, there exist infinitely 
many desingularization pairs {M. ,g). If a value defined by using (M. , g) does not depend on the 
choice of (At, g), then it is called a birational invariant. 

Firstly, as is shown in [TJ |9], the value (—A) is equal to the largest pole of the zeta function 
on C obtained by the analytic continuation of 

((z) = J (L(w) - L ) z p(w)dw (Re(z) > 0). 

Therefore, A is a birational invariant. This value is well known in algebraic geometry and 
algebraic analysis, which shows the relative relation of the pair of two algebraic varieties (W, Wq) 

PZ1 EDI 12311231123 ESI ■ 
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Secondly, the value v is characterized by 

v = Jim ^E[- ^{((logppQM) 2 ) - (logpp^H) 2 } . 

i=l 

Hence v is also a birational invariant. 

It was clarified by that, if a true distribution is unrealizableby and regular for a para- 
metric model, then 

A = d/2, 

v = tr(/J- 1 )/2, 

where I and J are respectively d x d matrices defined by 

I = £x[Vlogp(x|u;o)Vlogp(a;|u;o)], 
J = V 2 L(w ). 

For singular and realizable cases, A was calculated in [31 HI [5j [10] , whereas v is unknown. 

6.2 Renormalizability 

Let us discuss the renormalizable condition. 

Firstly, we study the renormalizable condition from the physical point of view. In physics, 
a set of functions {f n (x);n = 1,2, ...,} is sometimes called renormalizable if there exists some 
rescaling transform by which a universal law is discovered. For example, if there exist both a 
set (a, 6) and a function f*{x) such that 

lim n a f n (n b x) -> f*(x), 

n— >oo 

then such a system is called renormalizable. In this paper, we have shown that, if (q(x) , p(x\w)) 
is renormalizable, then the Boltzmann distribution satisfies the convergence in law, 

n x f 00 
— j exp(-nf3H n (g(u))) -> / t x ~ l exp(-n/3t + Py/t £(u))dt, 

when n tends to infinity, where is a gaussian process defined by the central limit theorem of 
the functional space. If (q(x),p(x\w)) does not satisfy the renormalizable condition, then such 
a rescaling transform does not exist in general. The expectation and the variance of 

n 

nH n (w) = } j f(X i ,w) 
i=i 

are respectively given by 

E[nH n (w)} = nE x [f(X,w)}, V[nH n (w)] = nV x [f(X,w)}. 

Because E x [f{X,w)} = L(w) - L > and V x [f(X,w)] ^ (1/2) D(p \\p w ) in the neighbor- 
hood L(w) — Lq = 0, the renormalizable condition ensures that the fluctuation of the random 
Hamiltonian is bounded by the average one. This is the intuitive reason why the universal law 
holds. 

Secondly, we study scale invariantness of renormalizablity. Let fi(x,w) and f2(x,w) be log 
likelihood ratio functions of two different statistical problems. If they are renormalizable and 
satisfy the relations 

E x [h(X,w)] = E x [f 2 (X,w)}, 
E x [h{X,w)f x {X,w')\ = E x [f 2 (X,w)f 2 (X,w% 
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then they have the same birational invariants (A, v). In other words, the learning curves are 
determined only by the average and covariance of the log density ratio function. It might seem 
that Ex[f(X, w)] 2 oc Ex[f(X, w) 2 ], but such a relation does not hold even in a trivial case. In 
a realizable and regular case, a £ M 1 , 

p(x\a) = j^ T j^eM~(x-a) 2 ) 

andq(x) = p(x\0), then f(x, a) = a 2 /2-ax, resulting that E x [f(X, a)} = a 2 /2 and E x [f(X, a) 2 } = 
a 2 + a 4 /4. Therefore, in the neighborhood of a = 0, both Ex[f(X,a)] and Ex[f(X,a) 2 ] are in 
proportion to a 2 . The renormalizable condition in this case is invariant under a scaling trans- 
form f(X,w) — > sf(X,w) for an arbitrary constant s > 0. The renormalizable condition of this 
paper is a generalized concept of such invar iantness. 

6.3 Bayes Observables as Random Variables 

In statistical learning theory, Bayes observables are random variables. In this paper, we mainly 
studied the expectation values of them. Note that the generating function F n (a) does not have 
sufficient information about randomness of Bayes observables. If a true distribution is regular 
or realizable, then stochastic properties of Bayes observables were clarified [TUl E] . It is a future 
study to clarify the stochastic behavior of Bayes observables as random variables. 

7 Conclusion 

In this paper, we defined the renormalizable condition of a learning system, and proved that, 
in the renormalizable case, the universal law holds. Also we showed that, in nonrenormalizable 
case, the universal law does not hold in general. It is the future study to clarify the more general 
universal learning theory, which contains both renormalizable and nonrenormalizable statistical 
problems. 
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